PERF: fix SparseArray._simple_new object initialization #32821

jorisvandenbossche · 2020-03-19T07:26:36Z

Apart from this being more idiomatic, it also avoids creating a SparseArray through the normal machinery (including validation of the input etc) for the empty list.

With this PR:

In [1]: data = np.array([1, 2, 3], dtype=float)  

In [2]: index = pd.core.arrays.sparse.IntIndex(5, np.array([0, 2, 4]))  

In [3]: dtype = pd.SparseDtype("float64", 0)      

In [4]: pd.arrays.SparseArray._simple_new(data, index, dtype)  
Out[4]: 
[1.0, 0, 2.0, 0, 3.0]
Fill: 0
IntIndex
Indices: array([0, 2, 4], dtype=int32)

In [5]: %timeit pd.arrays.SparseArray._simple_new(data, index, dtype)    
381 ns ± 4.83 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

while on released version this gives around 50µs (100x slower)

Noticed while investigating #32196

rth · 2020-03-19T09:49:27Z

Thanks @jorisvandenbossche! Quick benchmark result below when running pd.DataFrame.sparse.from_spmatrix on a random sparse CSR array of given n_samples, n_features with a density=0.01,

label                 master (s)    PR (s)
n_samples n_features                   
100       100000         14.5374   10.1247
10000     10000           1.5599    1.1037
100000    100             0.0180    0.0134

so overall this makes that method around 30% faster.

simonjayhawkins

Thanks @jorisvandenbossche lgtm

jorisvandenbossche · 2020-03-19T11:44:30Z

@rth thanks for the timings! Yes, it was indeed a large part of the original slow from_spmatrix, the snippet in the issue does most of the rest

jreback · 2020-03-19T11:45:58Z

so we have asvs for this? also add a whats new note

rth · 2020-03-19T12:02:46Z

Note added in #32825 that should work for both PRs I think.

…2821)

PERF: fix SparseArray._simple_new object initialization

3ffe7f2

jorisvandenbossche added Performance Memory or execution speed performance Sparse Sparse Data Type labels Mar 19, 2020

jorisvandenbossche added this to the 1.1 milestone Mar 19, 2020

simonjayhawkins approved these changes Mar 19, 2020

View reviewed changes

simonjayhawkins merged commit 34f3360 into pandas-dev:master Mar 19, 2020

jorisvandenbossche deleted the sparse-simple-new branch March 19, 2020 11:43

rth mentioned this pull request Mar 19, 2020

PERF: optimize DataFrame.sparse.from_spmatrix performance #32825

Merged

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

PERF: fix SparseArray._simple_new object initialization (pandas-dev#3…

29cc056

…2821)

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020

PERF: fix SparseArray._simple_new object initialization (pandas-dev#3…

022ec63

…2821)

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 25, 2020

PERF: fix SparseArray._simple_new object initialization (pandas-dev#3…

ebeb6bc

…2821)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: fix SparseArray._simple_new object initialization #32821

PERF: fix SparseArray._simple_new object initialization #32821

jorisvandenbossche commented Mar 19, 2020 •

edited

Loading

rth commented Mar 19, 2020

simonjayhawkins left a comment

jorisvandenbossche commented Mar 19, 2020

jreback commented Mar 19, 2020

rth commented Mar 19, 2020

PERF: fix SparseArray._simple_new object initialization #32821

PERF: fix SparseArray._simple_new object initialization #32821

Conversation

jorisvandenbossche commented Mar 19, 2020 • edited Loading

rth commented Mar 19, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 19, 2020

jreback commented Mar 19, 2020

rth commented Mar 19, 2020

jorisvandenbossche commented Mar 19, 2020 •

edited

Loading